{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "accelerator": "GPU",
    "colab": {
      "name": "21 - Export and run other machine learning models",
      "provenance": [],
      "collapsed_sections": []
    },
    "kernelspec": {
      "display_name": "Python 3",
      "name": "python3"
    }
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "4Pjmz-RORV8E"
      },
      "source": [
        "# Export and run other machine learning models\n",
        "\n",
        "txtai primarily has support for [Hugging Face Transformers](https://github.com/huggingface/transformers) and [ONNX](https://github.com/microsoft/onnxruntime) models. This enables txtai to hook into the rich model framework available in Python, export this functionality via the API to other languages (JavaScript, Java, Go, Rust) and even export and natively load models with ONNX.\n",
        "\n",
        "What about other machine learning frameworks? Say we have an existing TF-IDF + Logistic Regression model that has been well tuned. Can this model be exported to ONNX and used in txtai for labeling and similarity queries? Or what about a simple PyTorch text classifier? Yes, both of these can be done!\n",
        "\n",
        "With the [onnxmltools](https://github.com/onnx/onnxmltools) library, traditional models from [scikit-learn](https://scikit-learn.org/stable/), [XGBoost](https://xgboost.readthedocs.io/en/latest/) and others can be exported to ONNX and loaded with txtai. Additionally, Hugging Face's trainer module can train generic PyTorch modules. This notebook will walk through all these examples.\n",
        "\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Dk31rbYjSTYm"
      },
      "source": [
        "# Install dependencies\n",
        "\n",
        "Install `txtai` and all dependencies."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "XMQuuun2R06J"
      },
      "source": [
        "%%capture\n",
        "!pip install git+https://github.com/neuml/txtai#egg=txtai[pipeline,similarity] datasets"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "r6nmtieHdMfr"
      },
      "source": [
        "# Train a TF-IDF + Logistic Regression model\n",
        "\n",
        "For this example, we'll load the emotion dataset from Hugging Face datasets and build a TF-IDF + Logistic Regression model with scikit-learn.\n",
        "\n",
        "The emotion dataset has the following labels:\n",
        "\n",
        "- sadness (0)\n",
        "- joy (1)\n",
        "- love (2)\n",
        "- anger (3)\n",
        "- fear (4)\n",
        "- surprise (5)\n"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "pg9-tUxEdRfk"
      },
      "source": [
        "from datasets import load_dataset\n",
        "\n",
        "from sklearn.feature_extraction.text import TfidfVectorizer\n",
        "from sklearn.linear_model import LogisticRegression\n",
        "from sklearn.pipeline import Pipeline\n",
        "\n",
        "ds = load_dataset(\"emotion\")\n",
        "\n",
        "# Train the model\n",
        "pipeline = Pipeline([\n",
        "    ('tfidf', TfidfVectorizer()),\n",
        "    ('lr', LogisticRegression(max_iter=250))\n",
        "])\n",
        "\n",
        "pipeline.fit(ds[\"train\"][\"text\"], ds[\"train\"][\"label\"])\n",
        "\n",
        "# Determine accuracy on validation set\n",
        "results = pipeline.predict(ds[\"validation\"][\"text\"])\n",
        "labels = ds[\"validation\"][\"label\"]\n",
        "\n",
        "results = [results[x] == label for x, label in enumerate(labels)]\n",
        "print(\"Accuracy =\", sum(results) / len(ds[\"validation\"]))"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stderr",
          "text": [
            "Using custom data configuration default\n",
            "Reusing dataset emotion (/root/.cache/huggingface/datasets/emotion/default/0.0.0/348f63ca8e27b3713b6c04d723efe6d824a56fb3d1449794716c0f0296072705)\n"
          ]
        },
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Accuracy = 0.8595\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "49jZD4jQgdBg"
      },
      "source": [
        "86% accuracy - not too bad! While we all get caught up in deep learning and advanced methods, good ole TF-IDF + Logistic Regression is still a solid performer and runs much faster. If that level of accuracy works, no reason to overcomplicate things."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "zZtHxNSwFNGC"
      },
      "source": [
        "# Export and load with txtai\n",
        "\n",
        "The next section exports this model to ONNX and shows how the model can be used for similarity queries. "
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "JBeScS5dFNeW",
        "outputId": "e1b5cbf4-87dd-4598-e7ee-14e36cf31a7c"
      },
      "source": [
        "from txtai.pipeline import Labels, MLOnnx, Similarity\n",
        "\n",
        "def tokenize(inputs, **kwargs):\n",
        "    if isinstance(inputs, str):\n",
        "        inputs = [inputs]\n",
        "\n",
        "    return {\"input_ids\": [[x] for x in inputs]}\n",
        "\n",
        "def query(model, tokenizer, multilabel=False):\n",
        "    # Load models into similarity pipeline\n",
        "    similarity = Similarity((model, tokenizer), dynamic=False)\n",
        "\n",
        "    # Add labels to model\n",
        "    similarity.pipeline.model.config.id2label = {0: \"sadness\", 1: \"joy\", 2: \"love\", 3: \"anger\", 4: \"fear\", 5: \"surprise\"}\n",
        "    similarity.pipeline.model.config.label2id = dict((v, k) for k, v in similarity.pipeline.model.config.id2label.items())\n",
        "\n",
        "    inputs = [\"that caught me off guard\", \"I didn t see that coming\", \"i feel bad\", \"What a wonderful goal!\"]\n",
        "    scores = similarity(\"joy\", inputs, multilabel)\n",
        "    for uid, score in scores[:5]:\n",
        "        print(inputs[uid], score)\n",
        "\n",
        "# Export to ONNX\n",
        "onnx = MLOnnx()\n",
        "model = onnx(pipeline)\n",
        "\n",
        "# Create labels pipeline using scikit-learn ONNX model\n",
        "sklabels = Labels((model, tokenize), dynamic=False)\n",
        "\n",
        "# Add labels to model\n",
        "sklabels.pipeline.model.config.id2label = {0: \"sadness\", 1: \"joy\", 2: \"love\", 3: \"anger\", 4: \"fear\", 5: \"surprise\"}\n",
        "sklabels.pipeline.model.config.label2id = dict((v, k) for k, v in sklabels.pipeline.model.config.id2label.items())\n",
        "\n",
        "# Run test query using model\n",
        "query(model, tokenize, None)"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "What a wonderful goal! 0.909473717212677\n",
            "I didn t see that coming 0.47113093733787537\n",
            "that caught me off guard 0.42067453265190125\n",
            "i feel bad 0.019547615200281143\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "d-y8gFJwCwKN"
      },
      "source": [
        "txtai can use a standard text classification model for similarity queries, where the label(s) are a list of fixed queries. The output above shows the best results for the query \"joy\"."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "cbqwX7GgKBkf"
      },
      "source": [
        "# Train a PyTorch model\n",
        "\n",
        "The next section defines a simple PyTorch text classifier. The transformers library has a trainer package that supports training PyTorch models, assuming some standard conventions/naming is used. "
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 239
        },
        "id": "k8PkTlBLKBTy",
        "outputId": "4f48bfb2-2f16-45e3-d3e6-f1a2a747fd09"
      },
      "source": [
        "# Set predictable seeds\n",
        "import os\n",
        "import random\n",
        "import torch\n",
        "\n",
        "import numpy as np\n",
        "\n",
        "from torch import nn\n",
        "from torch.nn import CrossEntropyLoss\n",
        "from transformers import AutoConfig, AutoTokenizer\n",
        "\n",
        "from txtai.models import Registry\n",
        "from txtai.pipeline import HFTrainer\n",
        "\n",
        "from transformers.modeling_outputs import SequenceClassifierOutput\n",
        "\n",
        "def seed(seed=42):\n",
        "    random.seed(seed)\n",
        "    os.environ['PYTHONHASHSEED'] = str(seed)\n",
        "    np.random.seed(seed)\n",
        "    torch.manual_seed(seed)\n",
        "    torch.cuda.manual_seed(seed)\n",
        "    torch.backends.cudnn.deterministic = True\n",
        "\n",
        "class Simple(nn.Module):\n",
        "    def __init__(self, vocab, dimensions, labels):\n",
        "        super().__init__()\n",
        "\n",
        "        self.config = AutoConfig.from_pretrained(\"bert-base-uncased\")\n",
        "        self.labels = labels\n",
        "\n",
        "        self.embedding = nn.EmbeddingBag(vocab, dimensions)\n",
        "        self.classifier = nn.Linear(dimensions, labels)\n",
        "        self.init_weights()\n",
        "\n",
        "    def init_weights(self):\n",
        "        initrange = 0.5\n",
        "        self.embedding.weight.data.uniform_(-initrange, initrange)\n",
        "        self.classifier.weight.data.uniform_(-initrange, initrange)\n",
        "        self.classifier.bias.data.zero_()\n",
        "\n",
        "    def forward(self, input_ids=None, labels=None, **kwargs):\n",
        "        embeddings = self.embedding(input_ids)\n",
        "        logits = self.classifier(embeddings)\n",
        "\n",
        "        loss = None\n",
        "        if labels is not None:\n",
        "            loss_fct = CrossEntropyLoss()\n",
        "            loss = loss_fct(logits.view(-1, self.labels), labels.view(-1))\n",
        "\n",
        "        return SequenceClassifierOutput(\n",
        "            loss=loss,\n",
        "            logits=logits,\n",
        "        )\n",
        "\n",
        "# Set seed for reproducibility\n",
        "seed()\n",
        "\n",
        "# Define model\n",
        "tokenizer = AutoTokenizer.from_pretrained(\"bert-base-uncased\")\n",
        "model = Simple(tokenizer.vocab_size, 128, len(ds[\"train\"].unique(\"label\")))\n",
        "\n",
        "# Train model\n",
        "train = HFTrainer()\n",
        "model, tokenizer = train((model, tokenizer), ds[\"train\"], per_device_train_batch_size=8, learning_rate=1e-3, num_train_epochs=15, logging_steps=10000)\n",
        "\n",
        "# Register custom model to fully support pipelines\n",
        "Registry.register(model)\n",
        "\n",
        "# Create labels pipeline using PyTorch model\n",
        "thlabels = Labels((model, tokenizer), dynamic=False)\n",
        "\n",
        "# Determine accuracy on validation set\n",
        "results = [row[\"label\"] == thlabels(row[\"text\"])[0][0] for row in ds[\"validation\"]]\n",
        "print(\"Accuracy = \", sum(results) / len(ds[\"validation\"]))"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stderr",
          "text": [
            "Loading cached processed dataset at /root/.cache/huggingface/datasets/emotion/default/0.0.0/348f63ca8e27b3713b6c04d723efe6d824a56fb3d1449794716c0f0296072705/cache-a983327c4471f5aa.arrow\n"
          ]
        },
        {
          "output_type": "display_data",
          "data": {
            "text/html": [
              "\n",
              "    <div>\n",
              "      \n",
              "      <progress value='30000' max='30000' style='width:300px; height:20px; vertical-align: middle;'></progress>\n",
              "      [30000/30000 02:28, Epoch 15/15]\n",
              "    </div>\n",
              "    <table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: left;\">\n",
              "      <th>Step</th>\n",
              "      <th>Training Loss</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <td>10000</td>\n",
              "      <td>1.017600</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <td>20000</td>\n",
              "      <td>0.286200</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <td>30000</td>\n",
              "      <td>0.152500</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table><p>"
            ],
            "text/plain": [
              "<IPython.core.display.HTML object>"
            ]
          },
          "metadata": {}
        },
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Accuracy =  0.883\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "nHQoJnrj60Pz"
      },
      "source": [
        "88% accuracy this time. Pretty good for such a simple network and something that could definitely be improved upon. \n",
        "\n",
        "Once again let's run similarity queries using this model."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "W5_NDInF5lFN",
        "outputId": "38a2c126-63e9-40dc-f309-a29826b5b937"
      },
      "source": [
        "query(model, tokenizer)"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "What a wonderful goal! 1.0\n",
            "that caught me off guard 0.9998751878738403\n",
            "I didn t see that coming 0.7328283190727234\n",
            "i feel bad 5.2972134609891875e-19\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "KmcsdIltDTwj"
      },
      "source": [
        "Same result order as with the scikit-learn model with scoring variations which is expected given this is a completely different model."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "-fNTi2jb68rv"
      },
      "source": [
        "# Pooled embeddings\n",
        "\n",
        "The PyTorch model above consists of an embeddings layer with a linear classifier on top of it. What if we take that embeddings layer and use it for similarity queries? Let's give it a try."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "J1yhfHKC7N7L",
        "outputId": "11567948-769a-44df-9057-9fe9837a73dd"
      },
      "source": [
        "from txtai.embeddings import Embeddings\n",
        "\n",
        "class SimpleEmbeddings(nn.Module):\n",
        "    def __init__(self, embeddings):\n",
        "        super().__init__()\n",
        "\n",
        "        self.embeddings = embeddings\n",
        "\n",
        "    def forward(self, input_ids=None, **kwargs):\n",
        "        return (self.embeddings(input_ids),)\n",
        "\n",
        "embeddings = Embeddings({\"method\": \"pooling\", \"path\": SimpleEmbeddings(model.embedding), \"tokenizer\": \"bert-base-uncased\"})\n",
        "print(embeddings.similarity(\"mad\", [\"Glad you found it\", \"Happy to see you\", \"I'm angry\"]))"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "[(2, 0.8323876857757568), (1, -0.11010512709617615), (0, -0.16152513027191162)]\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "0kTUEIcmBNuV"
      },
      "source": [
        "Definitely looks like the embeddings have stored knowledge. Could these embeddings be good enough to build a semantic search index, especially for sentiment based data, given the training dataset? Possibly. It certainly would run faster than a standard transformer model (see below). "
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "V7nAl3WtkBNK"
      },
      "source": [
        "# Train a transformer model and compare accuracy/speed\n",
        "\n",
        "Let's train a standard transformer sequence classifier and compare the accuracy/speed between the two. "
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 274
        },
        "id": "46fMiJrAIBu4",
        "outputId": "f0512cf8-3bc2-41ed-caff-e1541403f2a5"
      },
      "source": [
        "train = HFTrainer()\n",
        "model, tokenizer = train(\"microsoft/xtremedistil-l6-h384-uncased\", ds[\"train\"], logging_steps=2000)\n",
        "\n",
        "tflabels = Labels((model, tokenizer), dynamic=False)\n",
        "\n",
        "# Determine accuracy on validation set\n",
        "results = [row[\"label\"] == tflabels(row[\"text\"])[0][0] for row in ds[\"validation\"]]\n",
        "print(\"Accuracy = \", sum(results) / len(ds[\"validation\"]))"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stderr",
          "text": [
            "Loading cached processed dataset at /root/.cache/huggingface/datasets/emotion/default/0.0.0/348f63ca8e27b3713b6c04d723efe6d824a56fb3d1449794716c0f0296072705/cache-98b7ef31bf6ca944.arrow\n",
            "Some weights of BertForSequenceClassification were not initialized from the model checkpoint at microsoft/xtremedistil-l6-h384-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']\n",
            "You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.\n"
          ]
        },
        {
          "output_type": "display_data",
          "data": {
            "text/html": [
              "\n",
              "    <div>\n",
              "      \n",
              "      <progress value='6000' max='6000' style='width:300px; height:20px; vertical-align: middle;'></progress>\n",
              "      [6000/6000 07:13, Epoch 3/3]\n",
              "    </div>\n",
              "    <table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: left;\">\n",
              "      <th>Step</th>\n",
              "      <th>Training Loss</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <td>2000</td>\n",
              "      <td>0.635500</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <td>4000</td>\n",
              "      <td>0.281700</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <td>6000</td>\n",
              "      <td>0.192600</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table><p>"
            ],
            "text/plain": [
              "<IPython.core.display.HTML object>"
            ]
          },
          "metadata": {}
        },
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Accuracy =  0.926\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "ycvozGzPmlbS"
      },
      "source": [
        "As expected, the accuracy is better. The model above is a distilled model and even better accuracy can be obtained with a model like \"roberta-base\" with the tradeoff being increased training/inference time. \n",
        "\n",
        "Speaking of speed, let's compare the speed of these models."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "nWQMRQm0NwdN",
        "outputId": "4a49406c-b4eb-46b1-edab-de01c15fdccb"
      },
      "source": [
        "import time\n",
        "\n",
        "# Test inputs\n",
        "inputs = ds[\"test\"][\"text\"]\n",
        "print(\"Testing speed of %d items\" % len(inputs))\n",
        "\n",
        "start = time.time()\n",
        "r1 = sklabels(inputs, multilabel=None)\n",
        "print(\"TF-IDF + Logistic Regression time =\", time.time() - start)\n",
        "\n",
        "start = time.time()\n",
        "r2 = thlabels(inputs)\n",
        "print(\"PyTorch time =\", time.time() - start)\n",
        "\n",
        "start = time.time()\n",
        "r3 = tflabels(inputs)\n",
        "print(\"Transformers time =\", time.time() - start, \"\\n\")\n",
        "\n",
        "# Compare model results\n",
        "for x in range(5):\n",
        "  print(\"index: %d\" % x)\n",
        "  print(r1[x][0])\n",
        "  print(r2[x][0])\n",
        "  print(r3[x][0], \"\\n\")"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Testing speed of 2000 items\n",
            "TF-IDF + Logistic Regression time = 1.0483319759368896\n"
          ]
        },
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "PyTorch time = 2.0001697540283203\n",
            "Transformers time = 13.71584439277649 \n",
            "\n",
            "index: 0\n",
            "(0, 0.7258279323577881)\n",
            "(0, 1.0)\n",
            "(0, 0.998375654220581) \n",
            "\n",
            "index: 1\n",
            "(0, 0.854256272315979)\n",
            "(0, 1.0)\n",
            "(0, 0.9983494281768799) \n",
            "\n",
            "index: 2\n",
            "(0, 0.6306578516960144)\n",
            "(0, 0.9999700784683228)\n",
            "(0, 0.9982945322990417) \n",
            "\n",
            "index: 3\n",
            "(1, 0.554378092288971)\n",
            "(1, 0.9998960494995117)\n",
            "(1, 0.99846351146698) \n",
            "\n",
            "index: 4\n",
            "(0, 0.8961835503578186)\n",
            "(0, 1.0)\n",
            "(0, 0.9984095692634583) \n",
            "\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "1YMTyqIWDiOB"
      },
      "source": [
        "# Wrapping up\n",
        "\n",
        "This notebook showed how frameworks outside of Transformers and ONNX can be used as models in txtai.\n",
        "\n",
        "As seen in the section above, TF-IDF + Logistic Regression is 16 times faster than a distilled Transformers model. A simple PyTorch network is 8 times faster. Depending on your accuracy requirements, it may make sense to use a simpler model to get better runtime performance."
      ]
    }
  ]
}
